猫狗分类问题是计算机视觉领域的经典问题,识别图片中的猫和狗对人类来说,2岁小孩即可轻松完成,但是让计算机完成这一任务,却是曾经机器学习技术难以攻克的一座大山,直到30年前,深度学习之父杰弗里·辛顿将多层神经网络带入机器学习领域,它为近10年来深度学习的发展奠定了基础,使得这个曾经困扰很多机器学习领域实践者多年的问题迎刃而解,近年来也涌现出非常多的用于图像识别的深度学习模型,计算机视觉成为人工智能研究的热门领域。
大数据竞赛平台Kaggle提供了一个供机器学习爱好者自我实践的竞赛项目《Cats vs. Dogs》,在这个竞赛中,Kaggle提供了25000张猫和狗的图片作为训练数据集,提供了12500张猫和狗的图片作为测试集。
同时,为了提高模型性能,本项目另外利用了 The Oxford-IIIT Pet Dataset 7393张猫狗图片,这部分数据会作为训练集被使用。
本项目大部分模型训练的工作在AWS p3.2xlarge上完成,模型训练调整的过程共耗时70小时,其中使用p3.2xlarge 57小时,使用p2.xlarge 3小时。
整个项目的思路是利用keras的预训练模型,最开始使用VGG16,但是在验证集上logloss不变,尝试了调整激活函数、调整学习率、调整输出层激活函数,都不行,最后增加了GlobalAveragePooling2D层后,logloss才随着epoch下降,开始正常训练,但是模型的表现不好;然后尝试使用ResNet50,激活函数使用PReLU,使用全部数据训练,验证集上logloss为0.0493、准确率0.9828,kaggle分数0.12,未满足项目要求,然后调整激活函数为ELU,全连接层调整为256,优化器调整为Nadam,学习率调整为0.0005,模型在验证集上logloss下降为0.0368,准确率为0.9867,kaggle分数0.09334,未满足要求,之后尝试了各种调整,依然无法提高模型表现,然后想到补充训练集,于是使用The Oxford-IIIT Pet Dataset上7393张图上补充进训练集,模型在验证集上logloss下降为0.03286,准确率为0.9882,kaggle分数为0.07519,依然未满足要求,然后尝试使用ImageDataGenerator,依然不能提高模型表现;最后,尝试使用模型融合方法,在使用了ResNet50、InceptionV3和Xception三个模型融合后的模型,验证集上logloss下降为0.0069,准确率为0.9985,kaggle分数为0.0417,在学习使用模型融合方法的过程中,偶然学习到了clip方法,限制预测值在一个合理区间内,可以显著提高kaggle分数,因此我对之前单模型进行了clip,结果kaggle分数达到了0.05719,也满足项目要求。鉴于此,我在这个项目中同时保留了这两个模型。
import os, cv2, random, pickle
from tqdm import tqdm
import numpy as np
import pandas as pd
import csv
import shutil
import h5py
from urllib.request import urlretrieve
from os.path import isfile, isdir
import utils
from utils import *
import tarfile
import matplotlib.pyplot as plt
from matplotlib import ticker
%matplotlib inline
from keras.layers import Input, Dropout, Flatten, Dense, Activation, GlobalAveragePooling2D
from keras.optimizers import RMSprop,Nadam,SGD,Adam
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping, CSVLogger, LearningRateScheduler, ReduceLROnPlateau
from keras.utils import np_utils
from keras.models import load_model
from keras.preprocessing import image
from keras.preprocessing.image import ImageDataGenerator
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import PReLU, ELU
from keras.applications.vgg16 import VGG16
from keras.applications.xception import Xception
from keras.applications.resnet50 import ResNet50, preprocess_input
from keras.applications.inception_v3 import InceptionV3
from keras.models import Model
from keras import layers
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
由于项目模型在linux系统运行,因此使用wget通过kaggle api下载竞赛数据,下载命令kaggle competitions download -c dogs-vs-cats-redux-kernels-edition,由于下载数据需要使用kaggle账户,因此需要先对kaggle.json文件做配置。
kaggle数据集下载后是一个压缩文件,解压后可以得到训练集和测试集文件夹。训练集文件夹中包含25000张猫狗彩色图片,其中猫和狗各12500张,文件以猫狗标签以及文件编号命名;测试集文件夹中包含打乱的12500张猫狗图片,猫和狗随机分布,文件以编号命名。
Oxford-IIIT Pet数据集直接使用urlretrieve下载,下载后通过tarfile.open打开。
# 下载Oxford-IIIT Pet Dataset补充数据集
image_supply_path = './input/images'
image_supply_loacation = './images.tar.gz'
if isfile(image_supply_loacation):
tar_gz_path = image_supply_loacation
else:
tar_gz_path = 'images.tar.gz'
class DLProgress(tqdm):
last_block = 0
def hook(self, block_num=1, block_size=1, total_size=None):
self.total = total_size
self.update((block_num - self.last_block) * block_size)
self.last_block = block_num
if not isfile(tar_gz_path ):
with DLProgress(unit='B', unit_scale=True, miniters=1, desc='images supply dataset') as pbar:
urlretrieve('http://www.robots.ox.ac.uk/%7Evgg/data/pets/data/images.tar.gz', tar_gz_path, pbar.hook)
if not isdir(image_supply_path):
with tarfile.open(tar_gz_path) as tar:
tar.extractall(path='./input/')
tar.close()
test_folder_path(image_supply_path)
TRAIN_DIR = './input/train/'
train_images_path = [TRAIN_DIR+i for i in os.listdir(TRAIN_DIR)]
# 随机展示9张图片
random.seed(2018)
def random_show(location):
plt.subplot(location)
sample = random.choice(train_images_path)
img = cv2.imread(sample)
b,g,r = cv2.split(img) # 改变图片通道:BGR → RGB
rgb_img = cv2.merge([r,g,b])
plt.title(sample)
plt.imshow(rgb_img)
plt.figure(figsize=(12,12))
plt.subplots_adjust(wspace=0.2, hspace=0.2)
for location in range(331, 340):
random_show(location)
plt.show()
# 随机展示9张补充集图片
random.seed(2018)
TRAIN_SUP_DIR = './input/images/'
train_images_path_sup = [TRAIN_SUP_DIR + file for file in os.listdir(TRAIN_SUP_DIR)]
def random_show(location):
plt.subplot(location)
sample = random.choice(train_images_path_sup)
img = cv2.imread(sample)
b,g,r = cv2.split(img) # 改变图片通道:BGR → RGB
rgb_img = cv2.merge([r,g,b])
plt.title(sample)
plt.imshow(rgb_img)
plt.figure(figsize=(12,12))
plt.subplots_adjust(wspace=0.2, hspace=0.2)
for location in range(331, 340):
random_show(location)
plt.show()
从上图中可以发现,图片的宽高不统一,在训练模型前,需要对图片做同一宽高处理。
另外一点,图片并不全是猫和狗的特写,图片背景比较复杂,同时光影条件也不一,在训练模型时,要防止过拟合。
Oxford-IIIT Pet数据集未使用dog和cat命名,而是以更细的分类命名,因此需要对文件名做相应的处理,由于Oxford-IIIT Pet网站给出了文件分类与猫狗的关系,所以可以对文件名做相应处理
接下来,我们对整体数据的宽高做一下可视化。
# 以直方图展示图片的宽高分布——kaggle训练集
height = []
width = []
for file in tqdm(train_images_path):
image = cv2.imread(file)
height.append(image.shape[0])
width.append(image.shape[1])
plt.figure(figsize=(12,6))
plt.subplots_adjust(wspace=0.2, hspace=0.2)
plt.subplot(121)
plt.hist(height)
plt.title("height distribution")
plt.subplot(122)
plt.hist(width)
plt.title("width distribution")
plt.show()
print('median of height: {}'.format(np.median(height)))
print('median of width: {}'.format(np.median(width)))
从上图可以看到,kaggle训练集数据的宽高中位数为(447,374),由此可以得出为了提高模型性能,把图片缩放到(350,350)的比例是合适的。
# 以直方图展示图片的宽高分布——补充数据集
height = []
width = []
for file in tqdm(train_images_path_sup):
image = cv2.imread(file)
if np.any(image != None):# 当图片可以读取时,处理读取图片失败的情况
height.append(image.shape[0])
width.append(image.shape[1])
plt.figure(figsize=(12,6))
plt.subplots_adjust(wspace=0.2, hspace=0.2)
plt.subplot(121)
plt.hist(height)
plt.title("height distribution")
plt.subplot(122)
plt.hist(width)
plt.title("width distribution")
plt.show()
print('median of height: {}'.format(np.median(height)))
print('median of width: {}'.format(np.median(width)))
从上图可以看到,补充训练集数据的宽高中位数为(500,375),由此可以得出为了提高模型性能,把图片缩放到(350,350)的比例是合适的。
# 以散点图展示图片的宽高分布
height_kg = []
width_kg = []
height_su = []
width_su = []
for file in train_images_path:
image = cv2.imread(file)
height_kg.append(image.shape[0])
width_kg.append(image.shape[1])
for file in train_images_path_sup:
image = cv2.imread(file)
if np.any(image != None):# 当图片可以读取时,处理读取图片失败的情况
height_su.append(image.shape[0])
width_su.append(image.shape[1])
plt.figure(figsize=(12,6))
plt.subplots_adjust(wspace=0.2, hspace=0.2)
plt.subplot(121)
plt.scatter(width_kg, height_kg, s=50)
plt.title("height&width scatter of kaggle dataset")
plt.subplot(122)
plt.scatter(width_su, height_su, s=50)
plt.title("height&width scatter of supply dataset")
plt.show()
print('median of kaggle height: {}, median of kaggle width: {}'.format(np.median(height_kg),np.median(width_kg)))
print('median of supply height: {}, median of supply width: {}'.format(np.median(height_su),np.median(width_su)))
在项目开始时,我选择使用单预训练模型完成本次项目,预训练模型在几次尝试下,选择了ResNet50.
在使用训练集之前,需要对训练集进行缩放、乱序、重命名、异常值处理等预处理工作。
从kaggle竞赛的讨论帖中,找到了一个网友整理的异常图片的csv文件,利用此文件对训练集做异常数据删除。
# 获取异常图片列表
random.seed(2018)
ab_img_list = []
csv_file = csv.reader(open('relabel.csv'))
for filename in csv_file:
if 'dog' in filename[1] or 'cat' in filename[1]:
if '.jpg' not in filename[1]:
filename = filename[1] + '.jpg'
ab_img_list.append(filename)
else:
ab_img_list.append(filename[1])
else:
pass
for i, file in enumerate(ab_img_list):
ab_img_list[i] = './input/train/' + file
# 随机展示9张异常图片
def random_show(location):
plt.subplot(location)
sample = random.choice(ab_img_list)
img = cv2.imread(sample)
b,g,r = cv2.split(img) # 改变图片通道:BGR → RGB
rgb_img = cv2.merge([r,g,b])
plt.title(sample)
plt.imshow(rgb_img)
plt.figure(figsize=(12,12))
plt.subplots_adjust(wspace=0.2, hspace=0.2)
for location in range(331, 340):
random_show(location)
plt.show()
# 删除训练集中异常的图片
ab_img_list = []
csv_file = csv.reader(open('relabel.csv'))
for filename in csv_file:
if 'dog' in filename[1] or 'cat' in filename[1]:
if '.jpg' not in filename[1]:
filename = filename[1] + '.jpg'
ab_img_list.append(filename)
else:
ab_img_list.append(filename[1])
else:
pass
for i, file in enumerate(ab_img_list):
ab_img_list[i] = './input/train/' + file
i = 0
for file in train_images_path:
if file in ab_img_list:
train_images_path.remove(file)
i = i + 1
print('deleted {} files.'.format(i))
# 展示两张打错标签的图片
wrong_label_list = ['./input/train/dog.11731.jpg', './input/train/dog.4334.jpg']
plt.figure(figsize=(8,8))
plt.subplots_adjust(wspace=0.2, hspace=0.2)
plt.subplot(121)
img = cv2.imread(wrong_label_list[0])
b,g,r = cv2.split(img) # 改变图片通道:BGR → RGB
rgb_img = cv2.merge([r,g,b])
plt.title(wrong_label_list[0])
plt.imshow(rgb_img)
plt.subplot(122)
img = cv2.imread(wrong_label_list[1])
b,g,r = cv2.split(img) # 改变图片通道:BGR → RGB
rgb_img = cv2.merge([r,g,b])
plt.title(wrong_label_list[1])
plt.imshow(rgb_img)
plt.show()
# 删除打错标签的两张图片
wrong_label_list = ['./input/train/dog.11731.jpg', './input/train/dog.4334.jpg']
i = 0
for file in train_images_path:
if file in wrong_label_list:
train_images_path.remove(file)
i = i + 1
print('deleted {} files.'.format(i))
# 修改补充数据集文件名,以dog.x和cat.x命名
# utils写好了修改文件名方法:trange_file_name
trange_file_name('./input/images/')
# 找到补充训练集无法正常读取的图片
TRAIN_DIR_SUP = './input/images/'
train_images_path_sup = [TRAIN_DIR_SUP+i for i in os.listdir(TRAIN_DIR_SUP)]
none_img_list=[]
for file in tqdm(train_images_path_sup):
image = cv2.imread(file)
if np.any(image == None):
none_img_list.append(file)
print(file)
# 删除补充集无法读取的图片
i = 0
for file in train_images_path_sup:
if file in none_img_list:
train_images_path_sup.remove(file)
i = i + 1
print('deleted {} files.'.format(i))
# kaggle训练集与补充训练集融合
train_images = train_images_path + train_images_path_sup
# 对训练集数据做乱序处理
random.seed(2018)
random.shuffle(train_images)
ROWS=350
COLS=350
CHANNELS=3
def read_image(file_path):
'''
读取图片
'''
img = cv2.imread(file_path, cv2.IMREAD_COLOR)
return cv2.resize(img, (ROWS, COLS), interpolation=cv2.INTER_CUBIC) # 对训练图片做缩放处理
def prep_train_data(images_path):
'''
训练集数据预处理
'''
# 对labels进行独热编码
labels = np.zeros((len(images_path), 2), dtype=np.uint8)
for i, path in enumerate(images_path):
if 'dog' in path:
labels[i][0] = 1
else:
labels[i][1] = 1
count = len(images_path)
features = np.ndarray((count, ROWS, COLS, CHANNELS), dtype=np.uint8)
for i, image_file in tqdm(enumerate(images_path)):
image = read_image(image_file)
image.transpose((1,0,2))
features[i] = image
# if i%2500 == 0: print('Processed {} of {}'.format(i, count))
return features, labels
features, labels = prep_train_data(train_images)
print("features shape: {}".format(features.shape))
print("labels shape: {}".format(labels.shape))
# 训练集数据保存
pickle.dump((features, labels), open('train_data.p', 'wb'))
print('save data done!')
# 训练集数据读取
features, labels = pickle.load(open('train_data_batch.p', mode='rb'))
print('load data done!')
# 读取数据后验证,随机展示一张图片
random.seed(2018)
image_index = random.choice(range(len(features)))
image_file = features[image_index]
plt.imshow(image_file)
plt.title('num:{}'.format(labels[image_index]))
plt.show()
# ResNet50 with ELU
from keras.layers.normalization import BatchNormalization
from keras.layers import GlobalAveragePooling2D
from keras.layers.advanced_activations import PReLU, ELU
# optimizer = RMSprop(lr=1e-4)
# optimizer=SGD(0.001, momentum=0.9, nesterov=True)
# optimizer = SGD
optimizer = Nadam(lr=0.0005)
objective = 'binary_crossentropy'
base_model = ResNet50(include_top=False, weights='imagenet')
for layer in base_model.layers:
layer.trainable = False
head = base_model.output
batchnormed_1 = BatchNormalization(axis=3)(head)
avgpooled = GlobalAveragePooling2D()(batchnormed_1)
dense = Dense(256)(avgpooled)
batchnormed_2 = BatchNormalization()(dense)
relu = ELU()(batchnormed_2)
dropout = Dropout(0.2)(relu)
# dense = Dense(256)(dropout)
# batchnormed_2 = BatchNormalization()(dense)
# relu = ELU()(batchnormed_2)
# dropout = Dropout(0.2)(relu)
output = Dense(2, activation='sigmoid')(dropout)
model = Model(base_model.input, output)
model.compile(optimizer=optimizer, loss=objective, metrics=['accuracy'])
# 利用model_to_dot查看模型结构
SVG(model_to_dot(model).create(prog='dot', format='svg'))
# 查看模型参数
model.summary()
# 模型训练
random.seed(2018)
nb_epoch = 20
batch_size = 128
# 保存训练过程中验证集上表现最好的模型
val_checkpoint = ModelCheckpoint('resnet_bestval_{val_loss:.4f}.h5', monitor='val_loss', verbose=1, save_best_only=True, mode='min')
# cur_checkpoint = ModelCheckpoint('current.h5')
# 当模型在2个epoch上未提高时,降低2倍学习率
lrSchduler = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=2, cooldown=1, verbose=1)
# 自定义callback回调函数,在epoch结束时写入loss和val_loss
class LossHistory(Callback):
def on_train_begin(self, logs={}):
self.losses = []
self.val_losses = []
def on_epoch_end(self, batch, logs={}):
self.losses.append(logs.get('loss'))
self.val_losses.append(logs.get('val_loss'))
# 为了减少模型训练时间,同时防止过拟合,使用early_stopping在模型不提高性能的5个epoch后停止训练
early_stopping = EarlyStopping(monitor='val_loss', patience=5, verbose=1, mode='min')
# 由于训练数据是按照猫和狗依次排序的,因此在训练过程中对训练集做乱序处理,shuffle=True
def run_catdog():
history = LossHistory()
model.fit(features, labels, batch_size=batch_size, epochs=nb_epoch,
validation_split=0.2, verbose=1, shuffle=True,
callbacks=[history, early_stopping, val_checkpoint,lrSchduler])
return history
history = run_catdog()
# 模型训练过程可视化
loss = history.losses
val_loss = history.val_losses
plt.figure(figsize=(12,6))
plt.plot(loss, 'blue', label='Training Loss')
plt.plot(val_loss, 'green', label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss Trend')
plt.legend()
plt.show()
# 读取在验证集上表现最优的模型,对测试集做预测
model = load_model('bestval_0.0329.h5')
# 测试集数据预处理
ROWS = 350
COLS = 350
CHANNELS = 3
TEST_DIR = './test/test1/'
filenames = os.listdir(TEST_DIR)
filenames.sort(key=lambda x:int(x[:-4])) # 按文件名大小顺序排列
test_images_path = [TEST_DIR+i for i in filenames]
test_images_path[:10]
ROWS=350
COLS=350
CHANNELS=3
def read_image(file_path):
'''
读取图片
'''
img = cv2.imread(file_path, cv2.IMREAD_COLOR)
b,g,r = cv2.split(img) # 改变图片通道:BGR → RGB
rgb_img = cv2.merge([r,g,b])
return cv2.resize(rgb_img, (ROWS, COLS), interpolation=cv2.INTER_CUBIC) # 对训练图片做缩放处理
def prep_test_data(images_path):
'''
测试集数据预处理
'''
count = len(images_path)
test_images = np.ndarray((count, ROWS, COLS, CHANNELS), dtype=np.uint8)
for i, image_file in tqdm(enumerate(images_path)):
image = read_image(image_file)
image.transpose(1,0,2)
test_images[i] = image
return test_images
test_images = prep_test_data(test_images_path)
# 保存测试集数据
pickle.dump(test_images, open('test_data_350.p', 'wb'))
print('save data done!')
# 读取测试集数据
test_images = pickle.load(open('test_data_350.p', mode='rb'))
print('load data done!')
# 利用训练好的模型对测试集数据做预测
predictions = model.predict(test_images,batch_size=128, verbose=1)
# 随机展示9条预测结果
random.seed(2018)
plt.figure(figsize=(12,12))
plt.subplots_adjust(wspace=0.2, hspace=0.4)
for location in range(331, 340):
plt.subplot(location)
for i in np.random.randint(low=0,high=12500,size=1):
if predictions[i, 0] >= 0.5:
title = 'I am {:.3%} sure this is a Dog'.format(predictions[i][0])
else:
title = 'I am {:.3%} sure this is a Cat'.format(predictions[i][1])
plt.title(title)
plt.imshow(test_images[i])
plt.show()
#由于kaggle采用log_loss作为评分标准,参考log_loss对无穷大问题的处理,使用clip对预测值空间做限制,能显著提高kaggle分数
predictions = predictions.clip(min=0.005, max=0.995)
# 把预测结果以kaggle规定格式和文件顺序写入csv文件
with open('submission_0.0328.csv','w') as f:
f.write('id,label\n')
with open('submission_0.0328.csv','a') as f:
num = len(predictions)
pred = 0
for i in tqdm(range(0,num)):
pred = predictions[i, 0]
f.write('{},{}\n'.format(i+1, pred))
f.close()
print('file closed!')
使用多模型融合的基本思路是:使用ImageDataGenerator做图片处理,使用predict.generator获取预训练模型的特征向量,融合多个特征向量作为模型的收入,模型添加分类器后,可直接训练。
由于ImageDataGenerator要求训练集按照分类存放在不同的文件夹下,而kaggle训练集和补充数据集都没有按照猫和狗文件夹存放,因此需要对文件做转移,为了不破坏单模型的数据集,本项目使用shutil.copy方法做文件复制。
# 创建目录
import os
import shutil
TTRAIN_DIR = './input/train/'
train_images = os.listdir(TTRAIN_DIR)
train_cats = [file for file in train_images if 'cat' in file]
train_dogs = [file for file in train_images if 'dog' in file]
TEST_DIR = './input/test/'
test_images = [file for file in os.listdir(TEST_DIR)]
def mkdir(path):
isExists=os.path.exists(path)
if not isExists:
os.makedirs(path)
print(':{} 创建成功'.format(path))
return True
else:
print(':{} 目录已存在'.format(path))
return False
# 定义要创建的目录
mkpath_list=['./mydata2/', './mydata2/train/', './mydata2/train/cats/',
'./mydata2/train/dogs/', './mydata2/validation/', './mydata2/test1/', './mydata2/test1/test/']
for path in mkpath_list:
mkdir(path)
def copyfile(path, str, new_path):
'''
指定文件复制
'''
i = 0
for file in path:
if str in file:
shutil.copy(file, new_path)
i = i+1
print('copyed {} {} images to {}'.format(i, str, new_path))
def copyfileno(path, new_path):
'''
不指定文件复制
'''
i = 0
for file in path:
shutil.copy(file, new_path)
i = i+1
print('copyed {} images to {}'.format(i, new_path))
TRAIN_DIR = './input/train/'
train_images = [TRAIN_DIR + file for file in os.listdir(TTRAIN_DIR)]
TEST_DIR = './input/test/'
test_images = [TEST_DIR + file for file in os.listdir(TEST_DIR)]
TRAIN_SUP_DIR = './input/images/'
train_sup_images = [TRAIN_SUP_DIR + file for file in os.listdir(TRAIN_SUP_DIR)]
# 把kaggle训练集、补充数据集和kaggle测试集的数据复制一份到相应的文件夹下
copyfile(train_images, 'dog', './mydata2/train/dogs/')
copyfile(train_images, 'cat', './mydata2/train/cats/')
copyfile(train_sup_images, 'dog', './mydata2/train/dogs/')
copyfile(train_sup_images, 'cat', './mydata2/train/cats/')
copyfileno(test_images, './mydata2/test1/test/')
# 从文件夹中删除异常图片文件--49张
i = 0
for file in os.listdir('./mydata2/train/dogs/'):
if file in ab_img_list:
os.remove('./mydata2/train/dogs/' + file)
i = i + 1
for file in os.listdir('./mydata2/train/cats/'):
if file in ab_img_list:
os.remove('./mydata2/train/cats/' + file)
i = i + 1
print('deleted {} files.'.format(i))
# 从文件夹中删除错误标记的图片--2张
i = 0
for file in os.listdir('./mydata2/train/dogs/'):
if file in wrong_label_list:
os.remove('./mydata2/train/dogs/' + file)
i = i + 1
for file in os.listdir('./mydata2/train/cats/'):
if file in wrong_label_list:
os.remove('./mydata2/train/cats/' + file)
i = i + 1
print('deleted {} files.'.format(i))
#删除补充数据集中无法读取的异常图片
file_list = []
for path in none_img_list:
index = path.rfind('/')
file = path[index+1:]
file_list.append(file)
i = 0
for file in os.listdir('./mydata2/train/dogs/'):
if file in file_list:
os.remove('./mydata2/train/cats/' + file)
i = i + 1
for file in os.listdir('./mydata2/train/cats/'):
if file in file_list:
os.remove('./mydata2/train/cats/' + file)
i = i + 1
print('deleted {} files.'.format(i))
def write_feature_vectors(MODEL, image_size, lambda_func=None):
'''
获取训练数据和测试数据的基于keras预训练模型的特征向量
'''
width = image_size[0]
height = image_size[1]
input_tensor = Input((height, width, 3))
x = input_tensor
if lambda_func:
x = Lambda(lambda_func)(x)
base_model = MODEL(input_tensor=x, weights='imagenet', include_top=False)
model = Model(base_model.input, GlobalAveragePooling2D()(base_model.output))
gen = ImageDataGenerator()
train_generator = gen.flow_from_directory("./mydata2/train", image_size, shuffle=False, batch_size=16)
test_generator = gen.flow_from_directory("./mydata2/test1", image_size, shuffle=False, batch_size=16, class_mode=None)
train = model.predict_generator(train_generator, verbose=1)
test = model.predict_generator(test_generator, verbose=1)
with h5py.File('fv_{}_{}.h5'.format(MODEL.__name__, width)) as h:
h.create_dataset("train", data=train)
h.create_dataset("test", data=test)
h.create_dataset("label", data=train_generator.classes)
# 使用模型默认图片大小
# write_feature_vectors(VGG16, (224, 224))
write_feature_vectors(ResNet50, (224, 224))
write_feature_vectors(InceptionV3, (299, 299), inception_v3.preprocess_input)
write_feature_vectors(Xception, (299, 299), xception.preprocess_input)
# 使用350*350图片大小
# write_feature_vectors(VGG16, (350, 350))
write_feature_vectors(ResNet50, (350, 350))
write_feature_vectors(InceptionV3, (350, 350), inception_v3.preprocess_input)
write_feature_vectors(Xception, (350, 350), xception.preprocess_input)
# 模型融合
random.seed(2018)
X_train = []
X_test = []
for filename in ["fv_ResNet50.h5", "fv_Xception.h5", "fv_InceptionV3.h5"]:
with h5py.File(filename, 'r') as h:
X_train.append(np.array(h['train']))
X_test.append(np.array(h['test']))
y_train = np.array(h['label'])
X_train = np.concatenate(X_train, axis=1)
X_test = np.concatenate(X_test, axis=1)
# 模型构建,直接在预训练模型后加分类器
input_tensor = Input(X_train.shape[1:])
x = Dropout(0.5)(input_tensor)
x = Dense(1, activation='sigmoid')(x)
model = Model(input_tensor, x)
model.compile(optimizer='adadelta',
loss='binary_crossentropy',
metrics=['accuracy'])
# 利用model_to_dot查看模型结构
SVG(model_to_dot(model).create(prog='dot', format='svg'))
# 查看模型参数
model.summary()
# 模型训练
nb_epoch = 20
batch_size = 128
# 保存训练过程中验证集上表现最好的模型
val_checkpoint = ModelCheckpoint('resnet_bestval_{val_loss:.4f}.h5', monitor='val_loss', verbose=1, save_best_only=True, mode='min')
# cur_checkpoint = ModelCheckpoint('current.h5')
# 当模型在2个epoch上未提高时,降低2倍学习率
lrSchduler = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=2, cooldown=1, verbose=1)
# 自定义callback回调函数,在epoch结束时写入loss和val_loss
class LossHistory(Callback):
def on_train_begin(self, logs={}):
self.losses = []
self.val_losses = []
def on_epoch_end(self, batch, logs={}):
self.losses.append(logs.get('loss'))
self.val_losses.append(logs.get('val_loss'))
# 为了减少模型训练时间,同时防止过拟合,使用early_stopping在模型不提高性能的5个epoch后停止训练
early_stopping = EarlyStopping(monitor='val_loss', patience=5, verbose=1, mode='min')
# 在训练过程中对训练集做乱序处理,shuffle=True
def run_catdog():
history = LossHistory()
model.fit(X_train, y_train, batch_size=batch_size, epochs=nb_epoch,
validation_split=0.2, verbose=1, shuffle=True,
callbacks=[history, early_stopping, val_checkpoint,lrSchduler])
return history
history = run_catdog()
# 模型训练过程可视化
loss = history.losses
val_loss = history.val_losses
plt.figure(figsize=(12,6))
plt.plot(loss, 'blue', label='Training Loss')
plt.plot(val_loss, 'green', label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss Trend')
plt.legend()
plt.show()
# 加载在验证集上表现最优的模型
model = load_model('mix_bestval_0.0069.h5')
# 利用训练好的模型对测试集做预测
predictions = model.predict(X_test, verbose=1)
predictions = pickle.load(open('predictions.p', mode='rb'))
predictions[:10]
# 获取X_test列表
gen = ImageDataGenerator()
test_generator = gen.flow_from_directory("./test", (224, 224), shuffle=False, batch_size=16, class_mode=None)
file_list = []
for i, file in enumerate(test_generator.filenames):
index_1 = file.rfind('\\')
index_2 = file.rfind('.')
file_name = int(file[index_1+1:index_2])
file_list.append(file_name)
file_list = np.array(file_list)
# 随机展示9条预测结果
random.seed(2018)
test_images_path = './mydata2/test1/'
plt.figure(figsize=(12,12))
plt.subplots_adjust(wspace=0.2, hspace=0.4)
for location in range(331, 340):
plt.subplot(location)
for i in np.random.randint(low=0,high=12500,size=1):
if predictions[i] >= 0.5:
# print(predictions[i])
title = 'I am {:.3%} sure this is a Dog'.format(float(predictions[i]))
else:
# print(predictions[i])
title = 'I am {:.3%} sure this is a Cat'.format(float(1-predictions[i]))
plt.title(title)
file = test_generator.filenames[i]
img = cv2.imread(test_images_path + file)
b,g,r = cv2.split(img) # 改变图片通道:BGR → RGB
rgb_img = cv2.merge([r,g,b])
plt.imshow(rgb_img)
plt.show()
由于测试集数据不是按照文件名大小依次排序的,因此需要对predictions做排序。排序思路是利用X_test的列表排序对predictions做排序。
# 获取file_list正序排序,并根据此顺序对predictions排序
file_list_index = np.argsort(file_list)
p = np.zeros((len(predictions)), dtype=np.float32)
for key,value in enumerate(file_list_index):
p[key] = predictions[value]
predictions = p
predictions[:10]
#由于kaggle采用log_loss作为评分标准,参考log_loss对无穷大问题的处理,使用clip对预测值空间做限制,能显著提高kaggle分数
predictions = predictions.clip(min=0.005, max=0.995)
with open('submission_0.0069.csv','w') as f:
f.write('id,label\n')
with open('submission_0.0069.csv','a') as f:
num = len(predictions)
pred = 0
for i in tqdm(range(0,num)):
pred = predictions[i]
f.write('{},{}\n'.format(i+1, pred))
f.close()
print('file closed!')